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Abstract 

Malware remains a serious problem for corpora¬ 
tions, government agencies, and individuals, as attack¬ 
ers continue to use it as a tool to effect frequent and 
costly network intrusions. Today malware detection 
is still done mainly with heuristic and signature-based 
methods that struggle to keep up with malware evolu¬ 
tion. Machine learning holds the promise of automating 
the work required to detect newly discovered malware 
families, and could potentially learn generalizations 
about malware and benign software (benignware) that 
support the detection of entirely new, unknown malware 
families. Unfortunately, few proposed machine learn¬ 
ing based malware detection methods have achieved the 
low false positive rates and high scalability required to 
deliver deployable detectors. 

In this paper we introduce an approach that ad¬ 
dresses these issues, describing in reproducible detail 
the deep neural network based malware detection sys¬ 
tem that Invincea has developed. Our system achieves 
a usable detection rate at an extremely low false posi¬ 
tive rate and scales to real world training example vol¬ 
umes on commodity hardware. Specifically, we show 
that our system achieves a 95% detection rate at 0.1% 
false positive rate (FPR), based on more than 400,000 
software binaries sourced directly from our customers 
and internal malware databases. We achieve these re¬ 
sults by directly learning on all binaries, without any 
filtering, unpacking, or manually separating binary files 
into categories. Further, we confirm our false positive 
rates directly on a live stream of files coming in from 
Invincea’s deployed endpoint solution, provide an esti¬ 
mate of how many new binary files we expected to see 
a day on an enterprise network, and describe how that 
relates to the false positive rate and translates into an 
intuitive threat score. 

Our results demonstrate that it is now feasible to 
quickly train and deploy a low resource, highly accurate 
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machine learning classification model, with false posi¬ 
tive rates that approach traditional labor intensive sig¬ 
nature based methods, while also detecting previously 
unseen malware. Since machine learning models tend 
to improve with larger data-sizes, we foresee deep neu¬ 
ral network classification models gaining in importance 
as part of a layered network defense strategy in coming 
years. 


1. Introduction 

Malware continues to facilitate crime, espionage, 
and other unwanted activities on our computer net¬ 
works, as attackers use malware as a key tool their cam¬ 
paigns . One problem in computer security is therefore 
to detect malware, so that it can be stopped before it 
can achieve its objectives, or at least so that it can be 
expunged once it has been discovered. 

In this vein, various categories of detection ap¬ 
proaches have been proposed, including, on the one 
hand, rule or signature based approaches, which require 
analysts to hand craft rules that reason over relevant data 
to make detections, and, on the other hand, machine 
learning approaches, which automatically reason about 
malicious and benign data to fit detection model param¬ 
eters. A middle path between these two methods is the 
automatic generation of signatures. To date, the com¬ 
puter security industry, to our knowledge, has favored 
manual and automatically created rules and signatures 
over machine learning and statistical methods, because 
of the low false positive rates achievable by rule and 
signature-based methods. 

In recent years, however, a confluence of three 
developments have increased the possibility for suc¬ 
cess in machine-learning based approaches, holding the 
promise that these methods might achieve high detec¬ 
tion rates at low false positive rates without the bur¬ 
den of human signature generation required by manual 



methods. 

The first of these trends is the rise of commer¬ 
cial threat intelligence feeds that provide large vol¬ 
umes of new malware, meaning that for the first time, 
timely, labeled malware data are available to the secu¬ 
rity community. The second trend is that computing 
power has become cheaper, meaning that researchers 
can more rapidly iterate on malware detection machine 
learning models and fit larger and more complex mod¬ 
els to data. Third, machine learning as a discipline 
has evolved, meaning that researchers have more tools 
at their disposal to craft detection models that achieve 
breakthrough performance in terms of both accuracy 
and scalability. 

In this paper we introduce an approach that takes 
advantage of all three of these trends; a deployable deep 
neural network based malware detector using static fea¬ 
tures that gives what we believe to be the best reported 
accuracy results of any previously published detection 
engine. 

The structure of the rest of this paper is as follows. 
In Section|2]we describe our approach, giving a descrip¬ 
tion of our feature extraction methods, our deep neural 
network, and our Bayesian calibration model. In Sec¬ 
tion [3 we provide multiple validations of our approach. 
Section |4] treats related work, surveying relevant mal¬ 
ware detection research and comparing our results to 
other proposed methods. Finally, Section |5] concludes 
the paper, reiterating our findings and discussing plans 
for future work. 

2. Method 

Our full classification framework, shown in Fig. [T] 
consists of three main components. The first compo¬ 
nent extracts four different types of complementary fea¬ 
tures from the static benign and malicious binaries. The 
second component is our deep neural network classifier 
which consists of an input layer, two hidden layers and 
an output layer. The final component is our score cal¬ 
ibrator, which translates the outputs of the neural net¬ 
work to a score that can be realistically interpreted as 
approximating the probability that the file is actually 
malware. In the remainder of this section we describe 
each of these model components in detail. 

2.1. Feature Engineering 

2.1.1. Byte/Entropy Histogram Features. The first 
set of features that we compute for input binaries are the 
bin values of a two-dimensional byte entropy histogram 
that models the file’s distribution of bytes. 

To extract the byte entropy histogram, we slide a 



Figure 1. Outline of our malware classification 
framework. 

1024 byte window over an input binary, with a step size 
of 256 bytes. For each window, we compute the base-2 
entropy of the window, and each individual byte occur¬ 
rence in the window (1024 non-unique values) with this 
computed entropy value, storing the 1024 pairs in a list. 

Finally, we compute a two-dimensional histogram 
over the pair list, where the histogram entropy axis has 
sixteen evenly sized bins over the range [0,8], and the 
byte axis has sixteen evenly sized bins over the range 
[0,255]. To obtain a feature vector (rather than the ma¬ 
trix), we concatenate each row vector in this histogram 
into a single, 256-value vector. 

Our intuition in using these features is to model the 
contents of input files in a file-format agnostic way. We 
have found that in practice, the effect of representing 
byte values in the entropy “context” in which they occur 
separates byte values that occur in the context of, for 
example, x86 instruction data from, for example, byte 
values occurring in compressed data. 

2.1.2. PE Import Features. The second set of features 
that we compute for input binaries are derived from the 
input binary’s import address table. We initialize an ar¬ 
ray of 256 integers to zero; extract the import address 
table from the binary program; hash each tuple of DLL 
name and import function into the range [0,255]; and 
increment the associated counter in our feature array. 

Our intuition is that import table DLLs may help 
our model to capture the semantics of the external func¬ 
tion calls that a given input binary relies upon, thereby 
detecting either heuristically suspicious files or files 
with a combination of imports that match a known mal¬ 
ware family. By hashing the potentially large number 
of imported functions into a small array, we ensure that 


























our feature space remains fixed-size, which is important 
for scalability. In practice we find that even with a 256 
valued hash function our neural network learns a mean¬ 
ingful separation between malware and benignware, as 
shown later in our evaluation. 

2.1.3. PE Metadata Features. The final set of fea¬ 
tures are derived from the numerical fields extracted 
from target binary’s portable executable (PE) packag¬ 
ing. The portable executable format is the standard for¬ 
mat for executables on Windows-family operating sys¬ 
tems. To extract these features we extract numerical 
portable executable fields from the binary using the us¬ 
ing “pefile” Python parsing library. Each of these fields 
has a textual name (e.g., “compileJimestamp”), which, 
similar to the import table, we aggregate into 256-length 
array. 

Our motivation for extracting these features is to 
give our model the opportunity to identify both heuris- 
tically suspicious aspects of a given binary program’s 
packaging, and allow it to learn signatures that capture 
individual malware families. 

2.1.4. Aggregation of Feature Types. To construct 
our final feature vector, we take the four 256- 
dimensional feature vectors described above and con¬ 
catenate them into a single, 1024-dimensional feature 
vector. We found, throughout the course of our re¬ 
search, that this reduction of data into a priori fixed 
sized small vector resulted in only a minor degradation 
in the accuracy of our model, and allowed us to dra¬ 
matically reduce the memory and CPU time necessary 
to load and train our model, as compared to the more 
common approach of assigning each categorical value 
to its own column in the feature vector. 

2.1.5. Labeling. To train and evaluate our model at 
low false positive rates, we require accurate labels for 
our malware and benignware binaries. We accomplish 
this by running all of our data through VirusTotal, which 
runs the binaries through approximately 55 malware en¬ 
gines. We then use a voting strategy to decide if each file 
is either malware or benignware. 

Similar to [?], we label any file against which 30% 
or more of the anti-virus engines alarm as malware, and 
any file that no anti-virus engine alarms on as benign¬ 
ware. For the purposes of both training and accuracy 
evaluation we discard any files that more than 0% and 
less than 30% of VirusTotal’s anti-virus engines declare 
it malware, given the uncertainty surrounding the na¬ 
ture of these files. Note that we do not filter our binary 
files based on any actual content, as this could bias our 
results. 


2.2. Neural Network 

For classification, we use a deep feedforward neu¬ 
ral network consisting of four layers, where the first 
three 1024 node layers consist of a dropout 1^ . fol¬ 
lowed by a dense layer with either, a parametric recti¬ 
fied linear unit (PReLU) activation function ifTSll in the 
first two layers, or the sigmoid function, in the last hid¬ 
den layer (the fourth layer being the prediction). We 
elaborate on the reasoning behind these choices below. 

2.2.1. Design. First, our choice of using deep neural 
network, rather than a shallow but wide neural network, 
is based on the developed understanding that deep ar¬ 
chitectures can be more efficient (in terms of number 
of fitting parameters) than shallow network ID. This is 
important in our case, since the number of binary sam¬ 
ples in our dataset is still relatively small, as compared 
to all the possible binaries that can observed on a large 
enterprise network, and so our sampling of the feature 
space is limited. Our goal was to increase expressive¬ 
ness of the network, while maintaining a tractable size 
network that can be quickly trained on a single Ama¬ 
zon EC2 node. Given our four layer neural network de¬ 
sign, the remaining design choices are meant to address 
overfitting and improve efficacy of the backpropagation 
algorithm. 

2.2.2. Preventing Overfltting. Dropout has been 
demonstrated to be a very simple and efficient approach 
for preventing overfitting in deep neural network. 
Unlike standard weight regularizers, such as based on 
the or £2 norms, that push the weights toward some 
expected prior distribution ca, dropout seeks weights 
at each node that are complementary to weights in 
other nodes. The dropout solution is potentially more 
resilient to imperfect or dirty data (which is common 
when extracting features from similar malware that was 
compiled or packed using different software), since it 
discourages co-adaptation by creating multiple paths 
to correct classification throughout the network. This 
can be viewed as implicit bagging of several neural 
network models Eol. 

2.2.3. Speeding Up Learning. Rectified linear units 
(ReLU) have been shown to significantly speedup net¬ 
work training over traditional sigmoidal activation func¬ 
tions, such as tanh ll25l . by avoiding significant de¬ 
cay in gradient descent convergence rate after an ini¬ 
tial set of iterations. This slowdown is due to saturat¬ 
ing non-linearities in sigmoidal functions at their edges 
ESIEtKII. Using ReLU activation functions can also 
lead to bad performance when the input values are be¬ 
low 0, and PReLU activator functions are made to dy- 


namically adjust in order to avoid this issue, thus yield¬ 
ing signihcantly improved convergence rate ESI. 

Initialization of weights, before training, can sig¬ 
nihcantly impact the convergence of the backpropaga- 
tion algorithm ElllISl- The goal of a good initializa¬ 
tion is to avoid multiplicative impact of weight aggre¬ 
gation from multiple layers during backpropagation. In 
our approach we use the Gaussian distribution that is 
normalized based on the size of the input and output of 
the layers, as suggested in Ell. Before doing this ini¬ 
tialization we transform our feature values by applying 
the base-10 logarithm to each feature value, which we 
found in practice improved training performance sub¬ 
stantially. 

2.2.4. Formal Description. Let I = {0,1,2,3} be a 
layer in the network, the incoming values into 

the layer (for / = 1 those are the feature values), the 
output values of the layer, the weights of the layer 

that linearly transforms n input values into m output val¬ 
ues, the bias, and the associated activation vector 
function. The equation for I = {1,2,3} of the network 
is 

d{/)=y('-i).rW, 

zW (1) 

y«=F(z«), 

where • is a pointwise (elementwise) vector product, 
and r, are independent samples from a Bernoulli distri¬ 
bution with parameter h. The r values are resampled for 
each batch update during training, and h corresponds to 
the fraction of nodes that are kept during each batch up¬ 
date EO). Layer / = 0 is the input layer, and / = 4 is the 
output layer. 

For layers I = {1,2}, the activation function is the 
PReLU function, 

F(zf>) = (y(/\...,y«,...,y«) (2) 

where for some additional parameter a^p that is also ht 
during training, 

"tz!'' else. 

For the hnal layer I = 3, the activation function is the 
sigmoid function. 


which produces the output of our model. 

The loss for each ii sized batch sample is evaluated 
as the sum of the cross-entropy between the neural net’s 


prediction and the true label, 

= -t [yj^^sy*j + (1 -y,)iog(i -y*)] (5) 

7=1 

where y* is the output of our model for all n batch sam¬ 
ples, y*j is the output for sample j, and S {0,1} is the 
true label of the sample j, with 0 representing benign- 
ware and 1 malware. 

The neural network is training using backpropaga¬ 
tion and the Adam gradient-based optimizer ll2^ . which 
we observed to converge signihcantly faster than the 
standard stochastic gradient descent. 

2.3. Bayesian Calibration 

Beyond simply detecting malware in a binary sense 
our system also seeks to provide users with accurate 
probabilities that a given hie is malware. We do this 
through a Bayesian model calibration approach which 
combines our empirical belief about the ’’riskiness” of 
a given customer network (represented as our belief 
about the ratio of malware to benignware on the cus¬ 
tomer’s network) and empirical information about our 
neural network’s error prohle against test data. Here 
describe our specihc approach for adjusting the clas- 
siher’s “probability” score to rehect the true “threat” 
score, given this qualitatively assumed ratio of malware 
to benignware. 

Let 0 < s < 1 be some score given by the classi- 
her, rehecting the degree to which a classiher believes 
an observed binary is malware, with 0 being completely 
benign, and 1 being certainly malware. Our goal is to 
translate this number into a “threat” score, which will 
give the user a measure of how likely that the observed 
binary is actually malware. In line with this intuition, 
we dehne the threat score as the probability that the hie 
will actually be malware, P(C = m\S = s), given the 
score s, and category C = {m,b}. We will use capital 
P for probability, and the little p to represent probability 
density function (pdf), and for brevity drop the equality 
sign. 

Lets assume we have a pdfs for the benign and mal¬ 
ware scores for a given classiher, p(5 = i|C = m) and 
p(5 = s\C = b). We will describe in the next section how 
we derive these pdfs from observed test data. Given a 
base rate r, the ratio of malware to benignware, we will 
not derive how to compute the threat score. This will be 
done in two steps: i) we express our problem in terms 
of our classiher’s expected pdf for benign and malware 
predictions, and ii) we demonstrate how to practically 
compute these pdfs. 



2.3.1. Threat Score. Using Bayes’ rule we have 


P(OT|i) 


p(s\m)P{m) 


( 6 ) 


Rewriting p(i) as the sum of probabilities over the two 
possible labels, we get 


P{m\s) = 


p(s|m)P(m) 


p{s\m)P{m) +p(i|fe)P(/7)' 


(7) 


Finally, using the constraint that probabilities add 
up to 1, gives us the final value of the threat score 
in terms of pdfs and probability of observing malware 
(malware base rate), 


P(OT|i) = 


p(s\m)P{m) 


p(s|m)P(m) +p(s|fi)(l —P(m)) 


( 8 ) 


2.3.2. Probability Density Function Estimation. 

Given the above definition of the threat score, we need 
to derive the pdfs p(s|»j) and p(s|fi). There are two ap¬ 
proaches that are commonly used: i) the parametric ap¬ 
proach, where we assume some distribution for the pdfs, 
and fit the parameters of that distribution based on the 
observed samples, and ii) the non-parametric approach, 
like kernel density estimator (KDE), where we approx¬ 
imation a value of pdf given C by taking a weighted 
average of the neighborhood. 

Since it is not reasonable to expect the output of our 
ML classifier to follow some standard distribution, we 
used KDE with the Epanechnikov kernel lfT2l . In our 
testing it had better validation score than the standard 
Gaussian kernel. Since the pdfs can only take values in 
[0,1], we mirrored our samples to the left of 0 and the 
right of 1, before computing the estimated pdf value at 
a specific point. The window size was set empirically to 
0.01 to better approximate the tail end of distributions, 
were samples are less dense. 


3. Evaluation 


We evaluated our system in two ways. Eirst, we 
used our in-house database of malicious and benign bi¬ 
naries to conduct a set of cross-validation experiments 
testing how well our system performs using the indi¬ 
vidual feature sets described above and the agglomer¬ 
ation of the feature sets described above. Second, we 
used a live feed of binaries from Invincea customer net¬ 
works, in conjunction with a live feed of malicious bina¬ 
ries from the Jotti subscription threat intelligence feed, 
to measure the accuracy of our system in deployment 
contexts using all feature sets. 

All our experiments were ran on Amazon EC2 
g2.8xlarge instance, which has 60GB of RAM, and four 


1,536 CUDA core graphical processing units, of which 
we only used one. The software uses the Keras vO.1.1 
deep learning library to implement the neural network 
model described above. The feature extraction is mostly 
written in Cython and Python, heavily relying on SciPy 
and NumPy libraries, and each sample’s features are ex¬ 
tracted by a single thread process. Below we describe 
each of these evaluations in detail, starting with a de¬ 
scription of our evaluation datasets and then moving on 
to descriptions of our methodology and results. 

3.1. Dataset 

Our benign and malware binaries were drawn from 
Invincea’s own computer systems and Invincea’s cus¬ 
tomers networks. We used malicious binaries obtained 
from both the Jotti commercial malware feed and from 
Invincea’s private malware database. Our final dataset, 
after VirusTotal filtering, contains 431,926 binaries, 
with 81,910 labeled as benignware and 350,016 as mal¬ 
ware. Pig. |2] shows counts for the top malware fami¬ 
lies, as identified by the Kaspersky anti-virus engine, in 
our malware dataset. Pig. Ogives a histogram over the 
compile timestamps of both the malicious and benign 
binaries in our dataset. 

Not all malware binaries discovered by the secu¬ 
rity community are shared publicly, or as part of threat 
intelligence feeds, and the distribution of malware bina¬ 
ries that a specific enterprise network might experience 
could differ somewhat from our dataset. However, it 
significantly more critical is to have an accurate distri¬ 
bution of the benign files that reflects real usage, since 
that is what will drive the critical PPR estimation. Since 
the primary source of benign data comes directly from 
Invincea’s clients, we believe it is one the most accu¬ 
rate representations of the true distribution that has been 
evaluation. In our validation we confirm that our ROC 
curve PPR estimates match closely the live stream PPR 
(under the assumption that Invincea’s endpoint stream 
contains little to no malware). 

3.2. Cross-Validation Experiment 

We conducted five separate 4-fold cross-validation 
experiments, where for each experiment we randomly 
split our data into four equally sized partitions. Por each 
of the four partitions, we trained against three partitions 
and tested against the fourth. 

The first set of cross-validation experiments mea¬ 
sured our system’s individual accuracy of each of the 
four feature types described in Section 12.11 Por these 
experiments we reduced the size of the neural network 
input layer to 256 and training our network for 200 
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Figure 4. Six experiments showing the accuracy of our approach for different combination of feature 
types. For each experiment we show a set of soiid iines, which are the ROC of the individuai cross- 
vaiidation foids, and a dotted iine is the averaged vaiue of these ROC curves. A. aii four feature 
types; B. oniy PE import features; C. oniy byte/entropy features; D. oniy metadata features; E. oniy 
string features; F. aii features after we train oniy on the sampies whose compiie timestamp is before 
Juiy 31st, 2014 and test on sampies whose compiie timestamp is after Juiy 31st, 2014, exciuding 
sampies with biatantiy forged compiie timestamps. 


epochs or until training error falls below 0.02 for each 
fold. Our fifth experiment was conducted in the same 
way, but included all of our features and used the 1024 
neurons input layer. The results of these validations are 
shown in Fig. |4]A, B,C, D, and E, and is also summa¬ 
rized in Table [T] These validation results show there 
is significant variation in how each of our feature sets 
performs. Using all of the features together produced 
the best results, with an average of 95.2% of malicious 
binaries not seen in training detected at a 0.1% false 
positive rate, with area under the roc (AUC) of 0.99964. 
Fig. |5]shows that our ROC improves with the number of 
epochs, and we are not suffering from overfitting. For 
our full dataset, it takes approximately 15 seconds to 
train one epoch using a single GPU, so the full model 
can be effectively trained in about 40 minutes. 

In terms of independent feature sets, our PE meta¬ 
data features perform best, with close to 87% of mal¬ 
ware binaries unseen in training detected at a 0.1% false 
positive rate. Our string features also perform quite 


Table 1. Estimated TPR at 0.1% FPR and AUC, 
for the corresponding plots in Fig.|4| 


Features 

TPR 

AUC 

A. All 

95.2% 

0.99964 

B. PE Import 

22.8% 

0.95785 

C. Byte/entropy 

61.1% 

0.99145 

D. Metadata 

86.7% 

0.99912 

E. Strings 

68.8% 

0.99581 

F. All (Time Split) 

67.7% 

0.99794 


well, with 69% of unseen malware detected at a 0.1% 
false positive rate. While our byte-entropy features and 
import features don’t perform as well as our PE meta¬ 
data and string features, we found that they boost accu¬ 
racy beyond what string and PE metadata features can 
provide on their own. 
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Figure 2. Counts of the top 20 malware families 
in our experimental dataset. 



Year 


Figure 3. Normalized histogram of compile 
timestamps for our malware (left, red) and be- 
nignware (right, teal) datasets based on the 
Portable Executable compile timestamp field. 

3.3. Expected Deployment Performance 

One important question, if our classifier is to be de¬ 
ployed, is how to relate the cross-validated ROC to the 
expected performance in enterprise setting. To estimate 
expected performance, we observed the number of pre¬ 
viously unseen binaries that were detected for the en¬ 
tire set of customers during a span of a few days. This 
gave us an expected average of around 5 previously un¬ 
seen executed binaries per endpoint, per day. For FPR 
of 0.1%, this would result in about five false positives 
per day, per 1000 endpoints, assuming our ROC curve 
is an accurate estimate of actual performance. We have 
confirmed this result by directly running our sensor on 
incoming data (endpoint binaries and Jotti stream) over 
several days, which yielded a similar performance es¬ 
timate. Interestingly, some of the top false positives 


Figure 5. Plot showing the performance of 
our full 1024 feature model, as a function of 
training epochs, for each split of our cross- 
validation experiment. 

were anti-virus installers and dubious tools used by In- 
vincea’s product development team. 

3.4. Time Split Experiment 

A shortcoming of the standard cross-validation ex¬ 
periments is that they do not separate our ability to de¬ 
tect slightly modified malware from our ability to detect 
new malware toolkits and new versions of existing mal¬ 
ware toolkits. That is because most unique malware bi¬ 
naries, as defined by cryptographic checksums, are just 
automatically generated copies of various metamorphic 
malware designed to help it evade signature-based de¬ 
tection. 

Thus, to test our system’s ability to detect gen¬ 
uinely new malware toolkits, or new malware versions, 
we ran a time split experiment that better estimates 
our system’s ability to detect new malware toolkits and 
new versions of existing malware toolkits. We first ex¬ 
tracted the compile timestamp field from each binary 
executable in our dataset. Next we excluded binaries 
that had compile timestamps after July 31st, 2015 (the 
date on which the experiment was run) and binaries with 
compile timestamps before January 1st, 2000, since for 
those malware samples the authors have blatantly tam¬ 
pered with malware binaries’ compile timestamps or the 
file is corrupted. Finally, we split our test data into two 
sets: a set of binaries with compile timestamps before 
July 31st, 2014, and the set of binaries on or after July 
31st, 2014. Then, we trained our neural network (us¬ 
ing all of the features described above) on the earlier 
dataset and tested on the later dataset. While we can¬ 
not completely trust that malware authors do not often 
modify the compile timestamps on their binaries, there 


































is little motivation for doing so, and the distribution of 
dates matches what we know about our dataset sources, 
supporting that hypothesis. 

The results of this experiment, as shown in Fig. 
Hf, demonstrate that our system performs substantially 
worse on this test, detecting 67.7% of malware at a 
0.1% FPR, and approaching a 100% detection rate at 
a 1 % FPR. The substantial degradation in performance 
is not surprising given the difficulty of detecting gen¬ 
uinely novel malware programs versus detecting mal¬ 
ware samples that are new instances of known mali¬ 
cious platforms and toolkits. This result suggests that 
the classifier should be updated often using new data in 
order to main it’s classification accuracy. This, however, 
can be done rapidly and cheaply for a neural network 
classifier. 

4. Related Work 

Malware detection has evolved over the past sev¬ 
eral years, due to the increasingly growing threat posed 
by malware to large corporations and governmental 
agencies. Traditionally, the two major approaches for 
malware detection can be roughly split based on the ap¬ 
proach that is used to analyze the malware, either static 
and dynamic analysis (see review HD). In static analy¬ 
sis the malware file, or set of files, are either directly an¬ 
alyzed in binary form, or additionally unpacked and/or 
decompiled into assembly representation. In dynamic 
analysis, the binary files are executed, and the actions 
are recorded through hooking or some access into inter¬ 
nals of the virtualization environment. 

In principle, dynamic detection can provide direct 
observation of malware action, is less vulnerable to ob¬ 
fuscation ll28l . and makes it harder to recycle existing 
malware. However, in practice, automated execution of 
software is difficult, since malware can detect if it is 
running in a sandbox, and prevent itself from perform¬ 
ing malicious behavior. This resulted in an arms race 
between dynamic behavior detectors using a sandbox 
and malware lIDIIl. Further, in a significant number of 
cases, malware simply does not execute properly, due 
to missing dependencies or unexpected system config¬ 
uration. These issues make it difficult to collect a large 
clean dataset of malware behavior. 

Static analysis, on the other hand, while vulnera¬ 
ble to obfuscation, does not require elaborate or expen¬ 
sive setup for collection, and very large datasets can be 
created by simply aggregating the binaries files. Ac¬ 
curate labels can be computed for all these files using 
anti-virus aggregator sites like VirusTotal Q- 

This makes static analysis very amenable for ma¬ 
chine learning approaches, which tends to perform bet¬ 


ter as data size increases 0. Machine learning has been 
applied to malware detection at least since ll23l . with 
numerous approaches since (see reviews inilig). Ma¬ 
chine learning consists of two parts, the feature engi¬ 
neering, where the author transforms the input binary 
into a set of features, and a learning algorithm, which 
builds a classifier using these features. 

Numerous static features have been proposed for 
extracting features from binaries: printable strings Il29l . 
import tables ED, byte u-grams HD, opcodes ED, in¬ 
formational entropy ED- Assortment of features have 
also been suggested during the Kaggle Microsoft Mal¬ 
ware Classification Challenge El, such as opcode im¬ 
ages, various decompiled assembly features, and aggre¬ 
gate statistics. However, we are not aware of any pub¬ 
lished methods that break the file into subsamples (e.g., 
using sliding windows), and creates a histogram of all 
the file’s subsamples based on two or more properties 
of the individual subsample. 

Potentially the feature space can become large, in 
those cases methods like locality-sensitive hashing ll20l 
1^, feature hashing (aka hashing “trick”) ll^ |2T1 . or 
random projections ll22l l20l [ID fTOl have been used in 
malware detection. 

The large number of features, even after dimen¬ 
sionality reduction, can cause scalability issues for 
some machine learning algorithms. For example, non¬ 
linear SVM kernels require 0{N'^) multiplication dur¬ 
ing each iteration of the optimizer, where N is the num¬ 
ber of samples da. k-Nearest Neighbors (k-NN) re¬ 
quires finding k closest neighbors in a potentially large 
database of high dimensional data, during prediction, 
which requires significant computation and storage of 
all label samples. 

One popular alternative to the above are ensemble 
of trees (boosted trees or bagged trees), which can scale 
fairly efficiently by subsampling the feature space dur¬ 
ing each iterations 0. Decision trees can adapt well 
to various data types, and are resilient to heterogeneous 
scales of values in feature vectors, so they exhibit good 
performance even without some type of data standard¬ 
ization. However, standard implementations typically 
do not allow incremental learning, and fitting the full 
dataset with large number of features could potentially 
require expensive hardware. 

Recently, neural networks have emerged as a scal¬ 
able alternative to the standard machine learning ap¬ 
proaches, due to significant advances in training algo¬ 
rithms Il^l26l . Neural networks have been previously 
used in malware detection E3\ [TOl iTl. though it is not 
clear how to compare results, since datasets are differ¬ 
ent, in addition to the various pre-filtering of samples 
that is done before evaluation. 


5. Conclusion 

In this paper we introduced a deep learning based 
malware detection approach that achieves a detection 
rate of 95% and a false positive rate of 0.1% over an 
experimental dataset of over 400,000 software binaries. 
Additionally, we have shown that our approach requires 
modest computation to perform feature extraction and 
that it can achieve good accuracy over our corpus on a 
single GPU within modest timeframes. 

We believe that the layered approach of deep neural 
networks and our two dimensional histogram features 
provide an implicit categorization of binary types, al¬ 
lowing us to directly train on all the binaries, without 
separating them based on internal features, like packer 
types, and so on. 

Neural networks also have several properties that 
make them good candidates for malware detection. 
First, they can allow incremental learning, thus, not 
only can they be training in batches, but they can re¬ 
trained efficiently (even on an hourly or daily basis), 
as new training data is collected. Second, they allow 
us to combine labeled and unlabeled data, through pre¬ 
training of individual layers lfT9l . Third, the classi¬ 
fiers are very compact, so prediction can be done very 
quickly using low amounts of memory. 

Further attesting to the value of our approach, our 
system has proven a crucial part of our company’s over¬ 
all malware detection and prevention product and has 
been deployed to our cloud security analytics platform, 
which is currently performing detection on files stream¬ 
ing from thousands of customer endpoints. 

6. Software and Data 

The feature extraction code, data matrix, the label 
vector, and the neural network code is available at 

https://github.com/konstantinberlin/ 
Maiware-Detection-Using-Two- 
Dimensional-Binary-Program-Features. 
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